As a longtime avid fútbol fan, I am interested in exploring the effects of birth dates on a player’s career. This question is prompted by reading Malcolm Gladwell’s, Outliers in which he supports his relative age theory with incredible examples of how many professional athletes, particularly in Hockey, have an unusually large proportion of birthdays clustered in the beginning three months. One might even say it is statistically significant. The theory is that during adolescent years of development, children born near the cut off date (i.e. January 1) are more physically developed than their December birthday counterparts. This results in promotion to more competitive age groups and more attention in development. Malcolm labeled it, “accumulative advantage” or “The Matthew Effect” its namesake taken from Matthew 25:29 which states the adage, “the rich get richer and the poor get poorer” (Outliers, 2008).
“Does a footballer’s birthday affect their chances of reaching the professional level in European Football?”
“Does the intensity of The Matthew Effect increase, decrease, or stay the same when examining different tiers of football?”
The cutoff dates taken from Sports Tours for the top five leagues are as follows:
cutoff_dates <- data.frame(league = c("Premier League", "La Liga", "Bundesliga", "Ligue 1", "Serie A"),
cutoff_date = c("September 1", "January 1", "January 1", "January 1", "January 1"))
cutoff_dates %>% as_tibble() %>%
formattable(align = 'l')
| league | cutoff_date |
|---|---|
| Premier League | September 1 |
| La Liga | January 1 |
| Bundesliga | January 1 |
| Ligue 1 | January 1 |
| Serie A | January 1 |
Below is data scraped from statscrew
url <- "https://www.statscrew.com/minorhockey/roster/t-10853/y-2007"
paths_allowed(url)
##
www.statscrew.com
## [1] TRUE
gladwell <- read_html(url)
gladwell <- gladwell %>%
html_nodes("table") %>%
html_table() %>%
as.data.frame() %>%
rename(player = Player,
position = Pos.,
dob = Birth.Date,
height = Height,
weight = Weight,
sc = S.C,
hometown = Hometown) %>%
mutate(foot = str_extract(height, "^\\d+'"),
inch = str_extract(height, "\\d+\"$"),
foot = as.numeric(str_remove(foot, "[^\\d]")),
inch = as.numeric(str_remove(inch, "[^\\d]")),
height_cm = cm(foot * 12) + cm(inch)) %>%
separate(3, into = c("month", "day", "year")) %>%
separate(1, into = c("first_name", "last_name")) %>%
mutate(month_num = as.character(as.integer(factor(month, levels = month.name))),
dob = make_date(year,
month_num,
day)) %>%
relocate(dob, .before = month) %>%
filter(!is.na(dob))
gladwell <- gladwell %>%
slice(4,34,12,13,9,31,30,19,32,25,10,14,23,6,29,1,27,11,24,25,16,5,3,15,17) %>%
mutate(name = paste(first_name, last_name),
birthdate = paste(month_num, day, year)) %>%
select(name, position, sc, height, weight, birthdate, hometown) %>%
head(3)
The table above should resemble very closely to the graph as published in Outliers on page 20 and 21.
Much research has already been done on the relative age effect, particulary prominent in the lower age groups. Helsen, Winckel, and Williams conducted a thorough analysis of 2,175 fútbol players playing in international club competitions strictly in Europe. Their results showed an “over-representation of players born in the first quarter of the selection year (from January to March) for all the national youth selections at the under-15 (U-15), U-16, U-17 and U-18 age categories” (Helsen, Winckel, and Williams, 2005). The purpose of this report is to mirror their analysis and substituting an older demographic of players as the subjects of the study. Very similar to the structure of Helsen, Winckel, and Williams’ study, the first part of this report will simply view the distribution of players born in each month. The second part will explore in more detail characteristics such as as goals and assists contribution, FIFA score, and their market value in the 2020-2021 season and observe any patterns that may have favored birthdays closer to the cutoff date. Helsen, Winckel, and Williams concluded in their analysis on youth players that there was a statistically significant proportion of footballers born in the first three months than the last three but once the players reached the professional stage, other factors besides their birthdate were better indicators of their market value. I hypothesize the same will be true for the professional footballers I will be examining. I should note that differences in the results from these two studies should not be over analyzed as their are many confounding variables. Our datasets are from different time periods where the philosophies at the youth academy level might have changed and the solutions that Helsen, Winckel, and Williams proposed to reduce the relative age effect might have already been implemented. Coincidentally, some players might overlap between our studies. A 12-year old in 2005 would be 28 years old today. An 18-year old (the max age observed in Helsen, Winckel, and Williams’ study) would be 34 years old.
I intend to find data using CSV files provided by kaggle, FIFA World
Cup, and datahub. Depending on the availability of data and ease of
accessing a complete list of players, I intend to be as inclusive as
possible within the top 5 leagues in Europe. These are the Premier
League (ENG), La Liga (ESP), Bundesliga (GER), Serie A (ITA), and Ligue
One (FRA). Each player will then be treated as an individual case and
for the purposes of tidy data, merit its own row. Other variables to
explore are their goals to game ratio, their discipline record, market
worth over time, nationality, and the difference between club and
country dynamics. To explore the last variable, I will isolate a data
set of players who competed in the World Cup for the nation they
represent. This will include many more players who haven’t reached the
top 5 leagues but have still competed for their country. Since the FIFA
data set does not record data on player’s goals/assists contribution or
other more commonly analyzed statistics of talent, I intend to join the
data with another data set that lists players alongside statistics
relating to their match performances. I intend to employ web scraping
skills when permitted by the website from a variety of football
statistic websites. I found the FIFA data set to be an already
comprehensive in its detail of their player ratings and the scale of
players it included. However I found it incomplete in that it excluded
data points that measure a player’s success at the games such as goals,
assists, chances created, duels won, and their disciplinary record. The
data in its csv form that I found it was already in tidy format with an
individual player representing a single case with a multitude of
variables assigned to individal columns that described each player. The
package that I found included a list of data sets for a variety of years
with perfect column alignment which is conducive to a
bind_rows() function if so desired. Since we have
introduced time with this option, we could potentially explore a
player’s FIFA score over time.
I will intend to use R code to create tidy tables, apply statistical reasoning, and communicate my results effectively with both textual analysis and visualizations.
CSV file is taken from Wikiedia and contains the country codes for
each country. After renaming the columns to alphacode3 and
en_short_name, I selected said columns and assigned the
table to country_codesto be later joined with other
datasets in this report.
#Loading and Cleaning data set
country_codes <- read_csv("data/country_codes.csv")
## Rows: 246 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): English short name lower case, Alpha-2 code, Alpha-3 code, Numeric ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Cleaning dataset
country_codes <- country_codes %>%
rename(alphacode3 = `Alpha-3 code`,
en_short_name = `English short name lower case`) %>%
select(en_short_name, alphacode3)
I have also intentionally selected only variables that we are
primarily interested in our first round of exploration with birth dates.
Consequently the majority of the columns (i.e. weak_foot,
skill_moves, etc..) have been temporarily set aside.
Before we dive into the data, I intend to “clean” the data and ensure the proper classes of vectors for my variables for future calculations.
The following lists specific actions:
create a new variable that calclates the day of the year taken
from dob. For example, January 1 will be day 1 and December 31 will be
Day 365. For plotting purposes, I will be creating variables, taking
information from dob to make a variable for
year, month, and day, all
numeric. I also created a variable that calculates the day of the year
using the information from dobas well.
I created two variables titled quarter and
quarter_num which lists the quarter of the year that the
player was born. The levels for this variable are “Jan-Mar”, “Apri-Jun”,
“Jul-Sep”, and “Oct-Dec”. The levels for quarter_num are 1,
2, 3, and 4.
I created a ranking column based on the player’s FIFA’s score and
relocated the newly made year through
quarter_num variables (see step 1) to be placed after
long_name.
I select only the columns I have deemed to be of interest in this
report. Since the original players_21.csv file contained up to 109
variables, I have decided to exclude the majority of the variables that
relate to the FIFA score, instead opting to use overall as
an indicator of the player’s talent. One last step in selecting certain
columns was renaming short_name to name,
club_name to club, league_name to
league, league_rank to
league_tier_number, and overall to
fifa_score.
fifa_cleaning <- function(csv_file){
name <- read_csv(csv_file)
name <- name %>%
mutate(year = year(ymd(dob)),
month_num = month(ymd(dob)),
day = day(ymd(dob)),
month = factor(month.abb[month_num],levels=month.abb),
day_of_year = yday(dob),
quarter = ifelse(month_num %in% 1:3, "Jan-Mar",
ifelse(month_num %in% 4:6, "Apr-Jun",
ifelse(month_num %in% 7:9, "July-Sep",
"Oct-Dec"))),
quarter_num = ifelse(month_num %in% 1:3, 1,
ifelse(month_num %in% 4:6, 2,
ifelse(month_num %in% 7:9, 3,
4))),
fifa_rank = 1:nrow(name)) %>%
relocate(`year`: `quarter_num`, .after = `long_name`) %>%
select(fifa_rank, sofifa_id,
jersey = team_jersey_number,
`long_name`: `nationality`,
club = club_name,
league = league_name,
league_tier_number = league_rank,
fifa_score = overall,
value_eur,
wage_eur,
joined)
name
}
fifa_players_15 <- fifa_cleaning("data/players_15.csv")
## Rows: 16155 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (42): player_url, short_name, long_name, nationality, club_name, league...
## dbl (60): sofifa_id, age, height_cm, weight_kg, league_rank, overall, poten...
## lgl (2): release_clause_eur, mentality_composure
## date (2): dob, joined
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fifa_players_16 <- fifa_cleaning("data/players_16.csv")
## Rows: 15623 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (42): player_url, short_name, long_name, nationality, club_name, league...
## dbl (60): sofifa_id, age, height_cm, weight_kg, league_rank, overall, poten...
## lgl (2): release_clause_eur, mentality_composure
## date (2): dob, joined
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fifa_players_17 <- fifa_cleaning("data/players_17.csv")
## Rows: 17597 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (42): player_url, short_name, long_name, nationality, club_name, league...
## dbl (61): sofifa_id, age, height_cm, weight_kg, league_rank, overall, poten...
## lgl (1): release_clause_eur
## date (2): dob, joined
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fifa_players_18 <- fifa_cleaning("data/players_18.csv")
## Rows: 17954 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (42): player_url, short_name, long_name, nationality, club_name, league...
## dbl (62): sofifa_id, age, height_cm, weight_kg, league_rank, overall, poten...
## date (2): dob, joined
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fifa_players_19 <- fifa_cleaning("data/players_19.csv")
## Rows: 18085 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (42): player_url, short_name, long_name, nationality, club_name, league...
## dbl (62): sofifa_id, age, height_cm, weight_kg, league_rank, overall, poten...
## date (2): dob, joined
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fifa_players_20 <- fifa_cleaning("data/players_20.csv")
## Rows: 18483 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (42): player_url, short_name, long_name, nationality, club_name, league...
## dbl (61): sofifa_id, age, height_cm, weight_kg, league_rank, overall, poten...
## lgl (1): defending_marking
## date (2): dob, joined
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fifa_players_21 <- fifa_cleaning("data/players_21.csv")
## Rows: 18944 Columns: 106
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (42): player_url, short_name, long_name, nationality, club_name, league...
## dbl (61): sofifa_id, age, height_cm, weight_kg, league_rank, overall, poten...
## lgl (1): defending_marking
## date (2): dob, joined
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
all_fifa_players <- fifa_players_15 %>%
bind_rows(fifa_players_16,
fifa_players_17,
fifa_players_18,
fifa_players_19,
fifa_players_20,
fifa_players_21)
all_distinct_fifa_players <- fifa_players_15 %>%
bind_rows(fifa_players_16,
fifa_players_17,
fifa_players_18,
fifa_players_19,
fifa_players_20,
fifa_players_21) %>%
distinct(long_name, .keep_all = TRUE) %>%
select(fifa_rank:quarter_num,
nationality:fifa_score)
all_fifa_players %>%
reactable(filterable = TRUE,
searchable = TRUE,
minRows = 5)